load necessary packages

Explore your dataset

load dataset and take a glimpse

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Age = col_double()
## )
## See spec(...) for full column specifications.
Data summary
Name multiple_choice_responses
Number of rows 16716
Number of columns 47
_______________________
Column type frequency:
character 46
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LearningPlatformUsefulnessArxiv 14325 0.14 10 15 0 3 0
LearningPlatformUsefulnessBlogs 11951 0.29 10 15 0 3 0
LearningPlatformUsefulnessCollege 13357 0.20 10 15 0 3 0
LearningPlatformUsefulnessCompany 15735 0.06 10 15 0 3 0
LearningPlatformUsefulnessConferences 14534 0.13 10 15 0 3 0
LearningPlatformUsefulnessFriends 15135 0.09 10 15 0 3 0
LearningPlatformUsefulnessKaggle 10133 0.39 10 15 0 3 0
LearningPlatformUsefulnessNewsletters 15627 0.07 10 15 0 3 0
LearningPlatformUsefulnessCommunities 15574 0.07 10 15 0 3 0
LearningPlatformUsefulnessDocumentation 14395 0.14 10 15 0 3 0
LearningPlatformUsefulnessCourses 10724 0.36 10 15 0 3 0
LearningPlatformUsefulnessProjects 11922 0.29 10 15 0 3 0
LearningPlatformUsefulnessPodcasts 15502 0.07 10 15 0 3 0
LearningPlatformUsefulnessSO 11076 0.34 10 15 0 3 0
LearningPlatformUsefulnessTextbook 12535 0.25 10 15 0 3 0
LearningPlatformUsefulnessTradeBook 16383 0.02 10 15 0 3 0
LearningPlatformUsefulnessTutoring 15290 0.09 10 15 0 3 0
LearningPlatformUsefulnessYouTube 11487 0.31 10 15 0 3 0
CurrentJobTitleSelect 4886 0.71 5 36 0 16 0
MLMethodNextYearSelect 5883 0.65 4 43 0 25 0
WorkChallengeFrequencyPolitics 14036 0.16 5 16 0 4 0
WorkChallengeFrequencyUnusedResults 14972 0.10 5 16 0 4 0
WorkChallengeFrequencyUnusefulInstrumenting 16077 0.04 5 16 0 4 0
WorkChallengeFrequencyDeployment 15869 0.05 5 16 0 4 0
WorkChallengeFrequencyDirtyData 13165 0.21 5 16 0 4 0
WorkChallengeFrequencyExplaining 15131 0.09 5 16 0 4 0
WorkChallengeFrequencyPass 16292 0.03 5 16 0 4 0
WorkChallengeFrequencyIntegration 15744 0.06 5 16 0 4 0
WorkChallengeFrequencyTalent 13720 0.18 5 16 0 4 0
WorkChallengeFrequencyDataFunds 15764 0.06 5 16 0 4 0
WorkChallengeFrequencyDomainExpertise 15308 0.08 5 16 0 4 0
WorkChallengeFrequencyML 15951 0.05 5 16 0 4 0
WorkChallengeFrequencyTools 15537 0.07 5 16 0 4 0
WorkChallengeFrequencyExpectations 15582 0.07 5 16 0 4 0
WorkChallengeFrequencyITCoordination 15547 0.07 5 16 0 4 0
WorkChallengeFrequencyHiringFunds 15429 0.08 5 16 0 4 0
WorkChallengeFrequencyPrivacy 15294 0.09 5 16 0 4 0
WorkChallengeFrequencyScaling 15883 0.05 5 16 0 4 0
WorkChallengeFrequencyEnvironments 15463 0.07 5 16 0 4 0
WorkChallengeFrequencyClarity 14537 0.13 5 16 0 4 0
WorkChallengeFrequencyDataAccess 14526 0.13 5 16 0 4 0
WorkChallengeFrequencyOtherSelect 16439 0.02 5 16 0 4 0
WorkInternalVsExternalTools 9959 0.40 11 45 0 6 0
FormalEducation 1701 0.90 15 65 0 7 0
DataScienceIdentitySelect 4045 0.76 2 22 0 3 0
JobSatisfaction 10039 0.40 1 23 0 11 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 331 0.98 32.37 10.47 0 25 30 37 100 ▁▇▂▁▁

Select the 3 rows with the highest number of levels

## [1] 16

pull() the column CurrentJobTitleSelect. and get the values of the levels.

##  [1] "Business Analyst"                    
##  [2] "Computer Scientist"                  
##  [3] "Data Analyst"                        
##  [4] "Data Miner"                          
##  [5] "Data Scientist"                      
##  [6] "DBA/Database Engineer"               
##  [7] "Engineer"                            
##  [8] "Machine Learning Engineer"           
##  [9] "Operations Research Practitioner"    
## [10] "Other"                               
## [11] "Predictive Modeler"                  
## [12] "Programmer"                          
## [13] "Researcher"                          
## [14] "Scientist/Researcher"                
## [15] "Software Developer/Software Engineer"
## [16] "Statistician"

Changing the order of factor levels

## [1] "Approximately half internal and half external"
## [2] "Do not know"                                  
## [3] "Entirely external"                            
## [4] "Entirely internal"                            
## [5] "More external than internal"                  
## [6] "More internal than external"
## [1] "Entirely internal"                            
## [2] "More internal than external"                  
## [3] "Approximately half internal and half external"
## [4] "More external than internal"                  
## [5] "Entirely external"                            
## [6] "Do not know"

Tricks of fct_relevel()

## [1] "Bachelor's degree"                                                
## [2] "Doctoral degree"                                                  
## [3] "I did not complete any formal education past high school"         
## [4] "I prefer not to answer"                                           
## [5] "Master's degree"                                                  
## [6] "Professional degree"                                              
## [7] "Some college/university study without earning a bachelor's degree"
## [1] "I did not complete any formal education past high school"         
## [2] "Some college/university study without earning a bachelor's degree"
## [3] "Bachelor's degree"                                                
## [4] "Master's degree"                                                  
## [5] "Professional degree"                                              
## [6] "Doctoral degree"                                                  
## [7] "I prefer not to answer"

Collapsing factor levels

##  [1] "Business Analyst"                    
##  [2] "Computer Scientist"                  
##  [3] "Data Analyst"                        
##  [4] "Data Miner"                          
##  [5] "Data Scientist"                      
##  [6] "DBA/Database Engineer"               
##  [7] "Engineer"                            
##  [8] "Machine Learning Engineer"           
##  [9] "Operations Research Practitioner"    
## [10] "Other"                               
## [11] "Predictive Modeler"                  
## [12] "Programmer"                          
## [13] "Researcher"                          
## [14] "Scientist/Researcher"                
## [15] "Software Developer/Software Engineer"
## [16] "Statistician"

Collapse the levels of CurrentJobTitleSelect into a new variable, grouped_titles.

## Warning: Factor `grouped_titles` contains implicit NA, consider using
## `forcats::fct_explicit_na`

lumping variables

Preserving the most common levels

Summarizing data

Creating an initial plot

Tricks of ggplot2

Changing and creating variables with case_when()

case_when() with a single variable

case_when() from multiple variables

## Warning: 108 parsing failures.
## row col expected                actual
##  37  -- a number I prefer not to share
## 115  -- a number I prefer not to share
## 167  -- a number I prefer not to share
## 403  -- a number I prefer not to share
## 427  -- a number I prefer not to share
## ... ... ........ .....................
## See problems(...) for more details.

Case study on Flight Etiquette

  • loading data
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   RespondentID = col_double()
## )
## See spec(...) for full column specifications.
  • take a glimpse so you have all your variables in hands
Data summary
Name flying_etiquette
Number of rows 1040
Number of columns 27
_______________________
Column type frequency:
character 26
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
How often do you travel by plane? 0 1.00 5 21 0 6 0
Do you ever recline your seat when you fly? 182 0.82 5 19 0 5 0
How tall are you? 182 0.82 4 14 0 20 0
Do you have any children under 18? 189 0.82 2 3 0 2 0
In a row of three seats, who should get to use the two arm rests? 184 0.82 22 59 0 5 0
In a row of two seats, who should get to use the middle arm rest? 184 0.82 19 44 0 5 0
Who should have control over the window shade? 184 0.82 40 59 0 2 0
Is itrude to move to an unsold seat on a plane? 185 0.82 14 19 0 3 0
Generally speaking, is it rude to say more than a few words tothe stranger sitting next to you on a plane? 185 0.82 14 19 0 3 0
On a 6 hour flight from NYC to LA, how many times is it acceptable to get up if you’re not in an aisle seat? 185 0.82 4 38 0 6 0
Under normal circumstances, does a person who reclines their seat during a flight have any obligation to the person sitting behind them? 186 0.82 72 83 0 2 0
Is itrude to recline your seat on a plane? 186 0.82 14 19 0 3 0
Given the opportunity, would you eliminate the possibility of reclining seats on planes entirely? 186 0.82 2 3 0 2 0
Is it rude to ask someone to switch seats with you in order to be closer to friends? 190 0.82 14 19 0 3 0
Is itrude to ask someone to switch seats with you in order to be closer to family? 190 0.82 14 19 0 3 0
Is it rude to wake a passenger up if you are trying to go to the bathroom? 190 0.82 14 19 0 3 0
Is itrude to wake a passenger up if you are trying to walk around? 190 0.82 14 19 0 3 0
In general, is itrude to bring a baby on a plane? 191 0.82 14 19 0 3 0
In general, is it rude to knowingly bring unruly children on a plane? 191 0.82 14 19 0 3 0
Have you ever used personal electronics during take off or landing in violation of a flight attendant’s direction? 191 0.82 2 3 0 2 0
Have you ever smoked a cigarette in an airplane bathroom when it was against the rules? 191 0.82 2 3 0 2 0
Gender 33 0.97 4 6 0 2 0
Age 33 0.97 4 5 0 4 0
Household Income 214 0.79 6 19 0 5 0
Education 39 0.96 15 32 0 5 0
Location (Census Region) 42 0.96 7 18 0 9 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RespondentID 0 1 3432710995 610418.3 3431729581 3432265101 3432671861 3433152970 3436139758 ▇▇▁▁▁
  • Change all character columns into factor columns and remove people who responded “Never” to a question asking if they have flown before.
  • Select columns where “rude” is in the column name.
  • Change the dataset from “wide” to “long”, with the variable names in a column called “response_var” and the values in a column called “value.”
## Warning: attributes are not identical across measure variables;
## they will be dropped
  • Check your data
## Observations: 7,866
## Variables: 2
## $ response_var <chr> "Is itrude to move to an unsold seat on a plane?", "Is...
## $ value        <chr> NA, "No, not rude at all", "No, not rude at all", "No,...

Data preparation and regex

  • Use str_remove to remove everything before and including “rude to” (with the space at the end) in the response_var column.
  • Remove rows with NA in the value column
  • Create a new variable, rude, which is 0 if the value column is “No, not rude at all” or “No, not at all rude” and 1 otherwise.
  • Check your data
  • Summarise the data set into two columns, the question (i.e. response_var), and a new column, perc_rude, the mean of the rude column for each question.
  • Save it as rude_behaviors and then view your new dataset.
  • Examine your data